1
Beyond Elementwise: The Shift to Tiled Matrix Operations
AI023 Lesson 9
00:00

In previous lessons, we focused on elementwise operations (like a basic ReLU on a matrix). These are memory-bound because the GPU spends more time moving data from HBM to registers than performing math.

1. Why GEMM is Central

General Matrix Multiplication (GEMM) has a computational complexity of $O(N^3)$ while only requiring $O(N^2)$ memory access. This allows us to hide memory latency behind massive arithmetic throughput, making it the "heartbeat" of LLMs.

2. 2D Memory Representation

Physical RAM is 1D. To represent a 2D tensor, we use Strides. A common production pitfall is assuming a tensor is contiguous. If you mix up row and column strides in your pointer math, you will access "ghost" data or trigger memory violations.

3. Tiled Generalization

Triton generalizes elementwise logic by shifting from single pointers to blocks of pointers. By using 2D tiles (e.g., $16 \times 16$), we exploit data reuse in high-speed SRAM, keeping data "hot" for fused operations like Bias addition or activations before writing back to Global Memory.

1D Linear Layout2D Tiled Layout
main.py
TERMINAL bash — 80x24
> Ready. Click "Run" to execute.
>